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Abstract 

The Minimum Description Length (MDL) principle selects the model that 
has the shortest code for data plus model. We show that for a countable 
class of models, MDL predictions are close to the true distribution in a strong 
sense. The result is completely general. No independence, ergodicity, station- 
arity, identifiability, or other assumption on the model class need to be made. 
More formally, we show that for any countable class of models, the distribu- 
tions selected by MDL (or MAP) asymptotically predict (merge with) the true 
measure in the class in total variation distance. Implications for non-i.i.d. do- 
mains like time-series forecasting, discriminative learning, and reinforcement 
learning are discussed. 
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1 Introduction 

The minimum description length (MDL) principle recommends to use, among com- 
peting models, the one that allows to compress the data-l- model most |Grii07j . The 
better the compression, the more regularity has been detected, hence the better will 
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predictions be. The MDL principle can be regarded as a formalization of Ockham's 
razor, which says to select the simplest model consistent with the data. 

Multistep lookahead sequential prediction. We consider sequential pre- 
diction problems, i.e. having observed sequence x = (xi,x 2 ,...,a;^) = Xi : £, predict 
z = (xt+x,...,X£+h) = X£+i : t+h) then observe xi+i G X for t = £(x) = 0,1,2,.... Classi- 
cal prediction is concerned with h = l, multi-step lookahead with 1 <h< oo, and 
total prediction with h = oo. In this paper we consider the last, hardest case. An 
infamous problem in this category is the Black raven paradox |Mah04l IHutOTj : Hav- 
ing observed i black ravens, what is the likelihood that all ravens are black. A 
more computer science problem is (infinite horizon) reinforcement learning, where 
predicting the infinite future is necessary for evaluating a policy. See Section [5] for 
these and other applications. 

Discrete MDL and Bayes. Let M. = {Qx,Q2, - } be a countable class of mod- 
e/s=theories=hypotheses=probabilities over sequences X°°, sorted w.r.t. to their 
complexity =codelength K(Qi) = 21og 2 i (say), containing the unknown true sampling 
distribution P. Our main result will be for arbitrary measurable spaces X, but to 
keep things simple in the introduction, let us illustrate MDL for finite X. 

In this case, we define Qi(x) as the Qj-probability of data sequence xE X e . It 
is possible to code x in logP(x) -1 bits, e.g. by using Huffman coding. Since x is 
sampled from P, this code is optimal (shortest among all prefix codes). Since we 
do not know P, we could select the Q G M. that leads to the shortest code on the 
observed data x. In order to be able to reconstruct x from the code we need to 
know which Q has been chosen, so we also need to code Q, which takes K(Q) bits. 
Hence x can be coded in mmQ eM {— \ogQ(x) + K(Q)} bits. MDL selects as model 
the minimizer 

MDL X := arg min{- log Q (x) + K(Q)} 

Given x, the true predictive probability of z is P(z\x) = P(xz)/P(x). Since P 
is unknown we use MDL x (z|x) := MDL :r (x2;)/MDL :r ( substitute. Our main 

concern is how close is the latter to the former. We can measure the distance between 
two predictive distributions by 

d h {P,Q\x) = \P(z\x) - Q(z\x)\ (1) 

for h<oo and d 00 = \im h ^ QO d h = sup{di,d 2 ,...}. It is easy to see that dh is monotone 
increasing and that d^ is twice the total variation distance (tvd) defined in (151) . 

MDL is closely related to Bayesian prediction, so a comparison to existing results 
for Bayes is interesting. Bayesians use Bayes (z\x) for prediction, where Bayes (x) := 
Y1iQ£mQ( x ) w Q ls ^ ne Bayesian mixture with prior weights wq > OWQ G M. and 
^2oeM w Q = -^ na tural choice is wq oc2~ K ( Q \ 
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Results. The following results can be shown 

£~ o E[40P,MDI/|2^)] < 2lh-2 K ( p \ 4oOP,MDI/|x) -> f almost surely 

Y^eLo ^[^(P, Bayes|^i^)] < h-lnwp 1 , d^P, Bayes|x) — > [for £(x)^oo 

(2) 

where the expectation E is w.r.t. P[-|x]. The left statements for h < oo imply 
(i/j — > almost surely, including some form of convergence rate. For Bayes it has 
been proven in [Hut03j; for MDL the proof in [PH05J can be adapted. As far as 
asymptotics is concerned, the right results — * are much stronger, and require 
more sophisticated proof techniques. For Bayes, the result follows from [BD62J. 
The proof for MDL is the primary novel contribution of this paper; more precisely 
for arbitrary measurable X in total variation distance. Another general consistency 
result is presented in |Gru07l Thm.5.1]. Consistency is shown (only) in probability 
and the predictive implications of the result are unclear. A stronger almost sure 
result is alluded to, but the given reference to |BC91] contains only results for i.i.d. 
sequences which do not generalize to arbitrary classes. So existing results for discrete 
MDL are far less satisfactory than the elegant Bayesian prediction in tvd. 

Motivation. The results above hold for completely arbitrary countable model 
classes M.. No independence, ergodicity, stationarity, identifiability, or other as- 
sumption need to be made. 

The bulk of previous results for MDL are for continuous model classes [ Grii07] . 
Much has been shown for classes of independent identically distributed (i.i.d.) ran- 
dom variables [B C911 IGrii 07]. Many results naturally generalize to stationary- 
ergodic sequences like (fcth-order) Markov. For instance, asymptotic consistency 
has been shown in [Bar85] . There are many applications violating these assump- 
tions, some of them are presented below and in Section [61 

One can often hear the exaggerated claim that (e.g. unlike Bayes) MDL can be 
used even if the true distribution P is not in M.. Indeed, it can be used, but the 
question is wether this is any good. There are some results supporting this claim, 
e.g. if P is in the closure of A4, but similar results exist for Bayes. Essentially P 
needs to be at least close to some Q&Ai for MDL to work, and there are interesting 
environments that are not even close to being stationary-ergodic or i.i.d. 

Non-i.i.d. data is pervasive |AHRU09] ; it includes all time-series prediction prob- 
lems like weather forecasting and stock market prediction [CBL06J. Indeed, these 
are also perfect examples of non-ergodic processes. Too much green house g ases, a 
massive volcanic eruption, an asteroid impact, or another world war could change 
the climate/economy irreversibly. Life is also not ergodic; one inattentive second 
in a car can have irreversible consequences. Also stationarity is easily violated in 
multi-agent scenarios: An environment which itself contains a learning agent is non- 
stationary (during the relevant learning phase). Extensive games and multi-agent 
reinforcement learning are classical examples [WR04J. 

Often it is assumed that the true distribution can be uniquely identified asymp- 
totically. For non-ergodic environments, asymptotic distinguishability can depend 
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on the realized observations, which prevent a prior reduction or partitioning of Ai. 
Even if principally possible, it can be practically burdensome to do so, e.g. in the 
presence of approximate symmetries. Indeed this problem is the primary reason for 
considering predictive MDL. MDL might never identify the true distribution, but 
our main result shows that the sequentially selected models become predictively 
indistinguishable. 

The countability of Ai is the severest restriction of our result. Neverthe- 
less the countable case is useful. A semi-parametric problem class Ud=i-^d with 
M-d = {Qe,d'-0 G M d } (say) can be reduced to a countable class Ai = {Pd} for which 
our result holds, where Pj is a Bayes or NML or other estimate of Aid |Grii07] . 
Alternatively, [J d Aid could be reduced to a countable class by considering only 
computable parameters 9. Essentially all interesting model classes contain such a 
countable topologically dense subset. Under certain circumstances MDL still works 
for the non-computable parameters |Grii07| . Alternatively one may simply reject 
non-computable parameters on philosophical grounds [Hut05j. Finally, the tech- 
niques for the countable case might aid proving general results for continuous Ai, 
possibly along the lines of |Rya09|. 

Contents. The paper is organized as follows: In Section [2] we provide some insights 
how MDL and Bayes work in restricted settings, what breaks down for general 
countable Ai, and how to circumvent the problems. The formal development starts 
with Section [31 which introduces notation and our main result. The proof for finite 
Ai is presented in Section H] and for denumerable Ai in Section [5j In Section [6] 
we show how the result can be applied to sequence prediction, classification and 
regression, discriminative learning, and reinforcement learning. Section [7] discusses 
some MDL variations. 

2 Facts, Insights, Problems 

Before starting with the formal development, we describe how MDL and Bayes work 
in some restricted settings, what breaks down for general countable Ai, and how to 
circumvent the problems. For deterministic environments, MDL reduces to learning 
by elimination, and the four results in (T5]) can easily be understood. Consistency of 
MDL for i.i.d. (and stationary-ergodic) sources is also intelligible. For general Ai, 
MDL may no longer converge to the true model. We have to give up the idea of 
model identification, and concentrate on predictive performance. 

Deterministic MDL = elimination learning. For a countable class Ai = 
{Qi-iQi,-- } of deterministic theories=models=hypotheses=sequences, sorted w.r.t. 
to their complexity=codelength K{Qi) = 21og 2 z (say) it is easy to see why MDL 
works: Each Q is a model for one infinite sequence Xi loo , i.e. Q(x®) = 1. Given the 
true observations x=x^. t so far, MDL selects the simplest Q consistent with x^ :l and 
for h=l predicts x® +1 . This (and potentially other) Q becomes (forever) inconsistent 
if and only if the prediction was wrong. Assume the true model is P = Q m . Since 
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elimination occurs in order of increasing index i, and Q m never makes any error, 
MDL makes at most m—l prediction errors. Indeed, what we have described is just 
classical Gold style learning by elimination. For 1 < h< oo, the prediction xf +1 . e+h 
may be wrong only on xf +h , which causes h wrong predictions before the error is 
revealed. (Note that at time i only xf is revealed.) Hence the total number of errors 
is bounded by h-(m— 1). The bound is for instance attained on the class consisting 
of Q i = l ih O co J and the true sequence switches from 1 to after having observed m-h 
ones. For h—oo, a wrong prediction gets eventually revealed. Hence each wrong Qi 
{i<m) gets eventually eliminated, i.e. P gets eventually selected. So for h — oo we 
can (still/only) show that the number of errors is finite. No bound on the number of 
errors in terms of m only is possible. For instance, for A^ = {Qi = l 00 ,Q2 = -P=l n 00 }, 
it takes n time steps to reveal that prediction 1°° is wrong, and n can be chosen 
arbitrarily large. 

Deterministic Bayes = majority learning. Bayesian learning is at the same 
time, closely related to and very different from MDL. Bayes predicts with a un- 
weighted average of the models (rather than with a single one). For a deterministic 
class, Bayes is similar to prediction by majority: Consider the models consistent with 
the true observation xf. e , having total weight W, and take the weighted majority 
prediction (this is the Bayes-optimal decision under 0-1 loss, Bayesian prediction 
would randomize). For h=l, making a wrong prediction means that Q's contributing 
to at least half of the total weight W get eliminated. Since P = Q m never gets 
eliminated, we have uip < W < 2~# Errors ; hence the number of errors is bounded by 
log 2 Wp 1 . For probabilistic Bayesian prediction proper, it is also easy to see that the 
expected number of errors is bounded by lmup 1 . One can show that these bounds 
are essentially sharp, (e.g. for Qi defined as the digits after the comma of the binary 
expansion of (i — l)/2 n for i — l..m and m = 2 n — 1.) With the same reasoning as in 
the MDL case, for ft>l we have to multiply the bound by h; and for h = oo we get 
correct prediction eventually, but no explicit bound anymore. 

Comparison of deterministic-*— ^-probabilistic and MDL<— »Bayes. The flavor 
of results carries over to some extent to the probabilistic case. On a very abstract 
level even the line of reasoning carries over, although this is deeply buried in the 
sophisticated mathematical analysis of the latter. So the special deterministic case 
illustrates the more complex probabilistic case. For instance for h — 1 and Wiccl/i 2 , 
we see that "Bayes" makes only 21og 2 m errors, while MDL can make up to the 
m errors. This carries over to the probabilistic case. Also the multiplier h for 
1 < h < oo and the lack of an explicit bound for h — oo carries over. Cf. the bounds 
in (T5]). The reader is invited to reveal other relations not explicitly mentioned here. 
The differences are as follows: In the probabilistic case, the true P can in general 
not be identified anymore. Further, while the Bayesian bound trivially follows from 
the 1/2-century old classical merging of opinions result [BD62J, the corresponding 
MDL bound we prove in this paper is more difficult to obtain. 

Consistency of MDL for stationary-ergodic sources. For an i.i.d. class Ai, 
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the law of large numbers applied to the random variables Z t :=log[P(xt)/Q(xt)] 
implies \Y% =1 Z t ^KL{P\\Q) ■=Y, Xl P^i)^g[P(x l )/Q(x 1 )} with P-probability 1. 
Either the Kullback-Leibler (KL) divergence is zero, which is the case if and only 
if P = Q, or logP(a:i : ^)— \ogQ(xi : i) = X^=i^~ KL(P| \Q)£^oo, i.e. asymptotically 
MDL does not select Q. For countable Ai, a refinement of this argument shows that 
MDL eventually selects P |BC91j . This reasoning can be extended to stationary- 
ergodic A4, but essentially not beyond. To see where the limitation comes from, we 
present some troubling examples. 

Trouble makers. For instance, let P be a Bernoulli (8 ) process, but let the Q- 
probability that x t = l be 9 t , i.e. time- dependent (still assuming independence). For 
a suitably converging but "oscillating" (i.e. infinitely often larger and smaller than 
its limit) sequence 9 t ^9 one can show that \og[P(xi :t ) /Q(xi :t )] converges to but 
oscillates around K(Q)—K(P) w.p.l, i.e. there are non-stationary distributions for 
which MDL does not converge (not even to a wrong distribution). 

One idea to solve this problem is to partition A4, where two distributions are in 
the same partition if and only if they are asymptotically indistinguishable (like P and 
Q above), and then ask MDL to only identify a partition. This approach cannot 
succeed generally, whatever particular criterion is used, for the following reason: 
Let P(xi)>0 \/x\. For x\ = l, let P and Q be asymptotically indistinguishable, e.g. 
P = Q on the remainder of the sequence. For x\ = 0, let P and Q be asymptotically 
distinguishable distributions, e.g. different Bernoullis. This shows that for non- 
ergodic sources like this one, asymptotic distinguishability depends on the drawn 
sequence. The first observation can lead to totally different futures. 

Predictive MDL avoids trouble. The Bayesian posterior does not need to con- 
verge to a single (true or other) distribution, in order for prediction to work. We can 
do something similar for MDL. At each time we still select a single distribution, but 
give up the idea of identifying a single distribution asymptotically. We just measure 
predictive success, and accept infinite oscillations. That's the approach taken in this 
paper. 

3 Notation and Main Result 

The formal development starts with this section. We need probability measures 
and filters for infinite sequences, conditional probabilities and densities, the total 
variation distance, and the concept of merging (of opinions), in order to formally 
state our main result. 

Measures on sequences. Let (Q := X 00 ^^) be the space of infinite sequences 
with natural filtration and product a-field T and probability measure P. Let to e Q 
be an infinite sequence sampled from the true measure P. Except when mentioned 
otherwise, all probability statements and expectations refer to P, e.g. almost surely 
(a.s.) and with probability 1 (w.p.l) are short for with P-probability 1 (w.P.p.l). 
Let x = X\,i = ijj\,i be the first £ symbols of uj. 
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For countable X, the probability that an infinite sequence starts with x is P(x): = 
P[{x}x X 00 ]. The conditional distribution of an event A given x is PL4|x] :=P[AC\ 
({x} x X°°)}/P(x), which exists w.p.l. For other probability measures Q on Q, we 
define Q(x) and <5[A|x] analogously. General X are considered at the end of this 
section. 

Convergence in total variation. P is said to be absolutely continuous relative to 
Q, written 

P < Q [Q[A] = implies P[A] = for all A & J 7 ) 

P and Q are said to be mutually singular, written P-LQ, iff there exists an Aef 
for which P[A] = 1 and Q[A] =0. The total variation distance (tvd) between Q and 
P given x is defined as 

d{P,Q\x) := sup |QL4|x] - P[A\x]\ (3) 

Q is said to predict P in tvd (or merge with P) if d(P,Q\x) — * for t(x) — > oo 
with P-probability 1. Note that this in particular implies, but is stronger than 
one-step predictive on- and off-sequence convergence Q(xe + i = ae + i\xi-e)—P(xe + i = 
at+i\xi.J)—*Q for any a, not necessarily equal u |KL94j . The famous Blackwell and 
Dubins convergence result [BD62] states that if P is absolutely continuous relative 
to Q, then (and only then [KL94j ) Q merges with P: 

If P<Q then d(P,Q\x)^0 w.p.l for £(x) -> oo 

Bayesian prediction. This result can immediately be utilized for Bayesian predic- 
tion. Let M. := {Qi,Q2,Q3,---} be a countable (finite or infinite) class of probability 
measures, and BayesL4] := ^2q eM Q[A]wq with w Q > VQ and ^QeM^O = 1- If 
the model assumption PgA^ holds, then obviously P^CBayes, hence Bayes merges 
with P, i.e. <i(P,Bayes|x) ^0 w.p.l for all P EM.. Unlike many other Bayesian 
convergence and consistency theorems, no (independence, ergodicity, stationarity, 
identifiability, or other) assumption on the model class M. need to be made. Good 
convergence rates for the weaker dh<oo distances have also been shown |Hut03] . The 
analogous result for MDL is as follows: 

Theorem 1 (MDL predictions) Let A4 be a countable class of probability mea- 
sures on X°° containing the unknown true sampling distribution P. No (indepen- 
dence, ergodicity, stationarity, identifiability, or other) assumptions need to be made 
on M. Let 

MDL X := argmin{-logQ(a;) + K(Q)} with ^ 2^ (Q) < oo 

QeM Q£M 

be the measure selected by MDL at time £ given x£X . Then the predictive distri- 
butions MDL x [-\x] converge to P[-\x] in the sense that 

d(P,MDL x \x) = sup \MDL x [A\x] - P[A\x]\ -> for £(x) -> oo w.p.l 
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K(Q) is usually interpreted and denned as the length of some prefix code for 
Q, in which case X}q2~ — 1- If K{Q) := ^°S2 w q 1 ^ s chosen as complexity, 
by Bayes rule Pt(Q\x) = Q (x)wq /Bayes(x) , the maximum a posteriori estimate 
MAP X :=argmaxQ e _A4{Pr(Q|x)} = MDL X . Hence the theorem also applies to MAP. 
The proof of the theorem is surprisingly subtle and complex compared to the anal- 
ogous Bayesian case. One reason is that MDI/(x) is not a measure on X°°. 

Arbitrary X. For arbitrary X, definitions are more subtle. The casual reader 
satisfied with countable X can skip this paragraph. We can consider even more 
generally x t EX t |BD62j . Let B t be a a-field of subsets of X t for t— 1,2,3,.... Let Ti 
be the cr-field for X :=X\ x ...xAj generated by (i.e. the smallest a-field containing) 
Bi x ...xBi for £<oo. Let (Q:=X 0C ,J r = J r 00 ,P) be a probability space. Let Pi be 
the marginal distribution on (X e ,J r g), i.e. Pe[A] :=P[Ax Xg + i x X^ +2 X ...] for AeJ-£. 
The predictive distribution P [A\xx.j] is (a version of) the conditional distribution of 
the future "x^+i :oo " given past xi-j, implicitly defined by f P e [A\xi : e]dPe(xi-e) :—P[A] 
VAgJF. Similarly define Qe and Q l for the other Q^Ai. See |Doo53] for details. 

Let M be a measure on £7 such that Q is absolutely continuous (see below) 
relative to M for all Q EA4. For instance M[-\ = Bayes [•] has this property. Now 
define the density ( Radon- Nikodym derivative) Qe(x\-e) (round brackets) of measure 
Qe[-] (square brackets) relative to M^[-]. It is important to note that all essential 
quantities, in particular MDL^, are independent of the particular choice of M. We 
therefore plainly speak of the Q-density or even Q-probability of x. 

For countable X and counting measure M, Q e [A\x] and Qg(x) coincide with 
(5[A|x] and Q(x) above. In the following, we drop the sup&superscripts £, since 
they will always be clear from the argument. Note that by Carathodory's extension 
theorem, {Q(x):xeX*} uniquely defines Q[A] WA^J 7 . 

4 Proof for Finite Model Class 

We first prove Theorem [T] for finite model classes Ai. For this we need the following 
Definition and Lemma: 

Definition 2 (Relations between Q and P) For any probability measures Q 
and P, let 

• Q r +Q S = Q be the Lebesgue decomposition of Q relative to P into an absolutely 
continuous non-negative measure Q r <^P and a singular non-negative measure 
Q S ^P. 

• g(u) := dQ r /dP = limi- too [Q(xi : i)/P(xi : e)] be (a version of) the Radon- 
Nikodym derivative, i.e. Q r [A] = J A gdP. 

• n°:= {u:Q(x lu )/P(x 1:i )^0} = {co:g(co) = 0}. 

• Cl := {co:d{P,Q\x)^0 for £(x)^oo}. 
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It is well-known that the Lebesgue decomposition exists and is unique. The rep- 
resentation of the Radon-Nikodym derivative as a limit of local densities can e.g. 
be found in [Doo53l VII§8]: Z r / a (u) ■.= Q r/s (x 1:e )/P(x 1:t ) for £=1,2,3,... constitute 
two martingale sequences, which converge w.p.l. Q r <P implies that the limit 
is the Radon-Nikodym derivative dQ r /dP. (Indeed, Doob's martingale convergence 
theorem can be used to prove the Radon-Nikodym theorem.) Q S -LP implies Z^ >o = 
w.p.l. So g is uniquely defined and finite w.p.l. 

Lemma 3 (Generalized merging of opinions) For any Q and P, the following 
holds: 

(i) P<Q if and only if P[Q°]=Q 

(ii) P[VL°] = implies P[fi] = l [fi)+ \BD6^ 1 
(Hi) P[Q°UQ] = 1 [generalizes (ii)] 

(i) says that Q(x)/P(x) converges almost surely to a strictly positive value if and 
only if P is absolutely continuous relative to Q, (ii) says that an almost sure positive 
limit of Q(x)/P(x) implies that Q merges with P. (Hi) says that even if P<^Q, 
we still have d(P,Q\x) — > on almost every sequence that has a positive limit of 
Q(x)/P(x). 

Proof. Recall Definition [2j 

(i«$=) Assume P[fi°] = 0: P[A]>0 implies Q[A}>Q r [A] = f A gdP>0, since g>0 
a.s. by assumption P[fi°] = 0. Therefore P^Q. 

Assume P < Q: Choose a B for which P[B] = 1 and Q S [B] = 0. Now 
Q r [tt°} = J no9 dP = implies 0<Q[Pnft°] <Q s [B]+Q T [tt°] = + 0. By P<Q this 
implies P[Pnfi°] = 0, hence P[Q°] = 0. 

(ii) That P^Q implies P[fi] = 1 is Blackwell-Dubins' celebrated result. The 
result now follows from (i). 

(Hi) generalizes [BD62J. For P[f2°] = it reduces to (ii). The case P[fi°] = 1 is 
trivial. Therefore we can assume 0<P[fi°] < 1. Consider measure P'[A] :=P[A\B] 
conditioned on B:=Q\Q°. 

Assume Q[A\ = 0. Using J Qo gdP = 0, we get = Q r [A] = J A g dP = J A ^ no gdP. 
Since g > outside Q°, this implies P[A\Q°] = 0. So P'[A] = P[AnB]/P[B] = 
P[A\n°]/P[B]=0. Hence P'<Q. Now (ii) implies d(P',Q\x)-+Q with P' probability 
1. Since P'<P we also get d(P',P\x)->0 w.P'.p.l. 

Together this implies < d(P,Q\x) < d(P',P\x)+d(P',Q\x) -> w.P'.p.l, i.e. 
P' [0] = 1 . The claim now follows from 

P[Q°UQ] = P'[Q° U Q]P[fi \ + P[Q° U Q\Q°]P[Q°] 

= 1 • P[0 \ 0°] + 1 • P[Q°] = P[0] = 1 ■ 

The intuition behind the proof of Theorem [1] is as follows. MDL will asymptoti- 
cally not select Q for which Q(x)/P(x) — >0. Hence for those Q potentially selected 
by MDL, we have hence a>Gf2, for which d(P,Q\x) — >0 (a.s.). The technical 
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difficulties are for finite Ai that the eligible Q depend on the sequence u, and for 
infinite Ai to deal with non-uniformly converging d, i.e. to infer d(P,MDL x \x) — >0. 

Proof of Theorem [1] for finite Ai. Recall Definition El and let gQ,QQ,(lQ refer to 
some QeAI = {Qi,..., Q m } . The set of sequences uj for which some Qq for some QeAi 
is undefined has P-measure zero, and hence can be ignored. Fix some sequence cuGfi 
for which gq(uj) is defined for all QeA4, and let Ai u := {Q G Ai :<7q(u;) =0}. 

MDI/ := arg min Ln(x), where L Q (x) := - log Q(x) + K(Q). 
Consider the difference 

L Q (x)-L P (x) = -]og^ + K(Q)-K(P)^-]ogg Q {u>)+K(Q)-K(P) 

For Qe.Mw, the r.h.s. is +00, hence 

VQeM^3l Q \/£>£ Q : L Q (x) > L P (x) 

Since Ai is finite, this implies 

W>4 VQG Mu, : Lq(x) > L P (x), where £ := max{£ Q : Q G M^} < 00 

Therefore, since P^Ai, we have MDI/ ./VC \/£>£q, so we can safely ignore all 
Q G M w and focus on Qe tH^ := Ai\M u . Let fti := f] Qt -^(^lQU^ Q ). Since 
P[Oi] = 1 by Lemma [3](iii) , we can also assume wGfli. 

QgAL Q (w) > cj^^q cjgIIq =^ d(P,Q\x)->0 

This implies d(P,MDL x \x) < sup d(P,Q\x) -> 

where the inequality holds for £ > £0 and the limit holds, since Ai is finite. Since 
the set of u excluded in our considerations has measure zero, d(P,MDL x \x) ^0 
w.p.l, which proves the theorem for finite Ai. ■ 



5 Proof for Countable Model Class 

The proof in the previous Section crucially exploited finiteness of At. We want to 
prove that the probability that MDL asymptotically selects "complex" Q is small. 
The following Lemma establishes that the probability that MDL selects a specific 
complex Q infinitely often is small. 

Lemma 4 (MDL avoids complex probability measures Q) For any Q and 

P we have P[Q(x)/P(x)>c infinitly often] <l/c. 
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Proof. P[W 3^ : 9& > 4 ( => Pfi S4 > c] < 

Pyx) e-+oo P(x) 

W 1 _Q(x) ( C ) l_ n . W 1 Q(X) (a) 1 

< -E lim — -— = -E lim — -— < - lim E — -— = - 

- c 1 e p(xy c [ — p{xy ~ c— P(xy c 

(a) is true by definition of the limit superior lim, (b) is Markov's inequality, (c) 
exploits the fact that the limit of Q(x)/P(x) exists w.p.l, (d) uses Fatou's lemma, 
and (e) is obvious. ■ 

For sufficiently complex Q, Lemma H] implies that Lq(x) > Lp(x) for most x. 
Since convergence is non-uniform in Q, we cannot apply the Lemma to all (infinitely 
many) complex Q directly, but need to lump them into one Q. 

Proof of Theorem [1] for countable Ad. Let the Q <EA4 = {Qi,Q 2 ,---} be ordered 
somehow, e.g. in increasing order of complexity K(Q), and P = Q n . Choose some 
(large) m>n and let M. := {Q m +i,Qm+2,---} be the set of "complex" Q. We show 
that the probability that MDL selects infinitely often complex Q is small: 

P[MDL X G M infinitely often] 
= P [V£ 3£ > £ : MDL* G M] 

< P[\/£ 3£>£ AQ G M : L Q (x) < L P (x)} 
= P[W£ 3£>£ : sup %M 2 W-*(Qi) > 1] 

< P[W£ 3£>£ :§^5 2 K ( p )>l] 



< 52 K(P) < e 



The first three relations follow immediately from the definition of the various quan- 
tities. Bound (a) is the crucial "lumping" step. First we bound 

Qi(x) K(Qi) Qi( x ) r,-K(Qi) _ S Q( X ) 



P(x) ~ ^ P(x) P(x) 

\ I i=m+l v ' v ; 



5 : = 2 ~ Km < °°» Q(?) ■= ] E Q<W K{Qi) > 

i>m i>m 

While MDL' [■] is not a (single) measure on Q and hence difficult to deal with, Q 
is a proper probability measure on Q. In a sense, this step reduces MDL to Bayes. 
Now we apply Lemma H] in (b) to the (single) measure Q. The bound (c) holds 
for sufficiently large m = m £ (P), since 5— >0 for m-^oo. This shows that for the 
sequence of MDL estimates 

{MDL Xl ' e :£ > £ } C {Q u Q m } with probability at least 1 - e 

Hence the already proven Theorem [1] for finite M. implies that d(P,MDL x \x) — > 
with probability at least 1 — e. Since convergence holds for every e > 0, it holds 
w.p.l. ■ 
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6 Implications 



Due to its generality, Theorem [T] can be applied to many problem classes. We 
illustrate some immediate implications of Theorem [I] for time-series forecasting, 
classification, regression, discriminative learning, and reinforcement learning. 

Time-series forecasting. Classical online sequence prediction is concerned with 
predicting xg + \ from (non-i.i.d.) sequence xxu for £=1,2,3,.... Forecasting farther 
into the future is possible by predicting xi + i-i + h for some h>0. One can show that 
< d\ < dh < dh+i <doo = 2d<2, see ([1]) and ([3]). Hence Theorem [1] implies good 
asymptotic (multi-step) predictions. Offline learning is concerned with training a 
predictor on x\._i for fixed £ in-house, and then selling and using the predictor on 
%e+i:oo without further learning. Theorem [T] shows that for enough training data, 
predictions "post-learning" will be good. 

Classification and Regression. In classification (discrete X) and regression (con- 
tinuous X), a sample is a set of pairs D = {(yi,xi),...,(yt,xe)}, and a functional 
relationship x = /(y)+noise, i.e. a conditional probability P(x\y) shall be learned. 
For reasons apparent below, we have swapped the usual role of x and y. The dots 
indicate x G X and y G y), while x = Xi : i G X 1 and y = y\.£ G y . If we assume that 
also y follows some distribution, and start with a countable model class M. of joint 
distributions Q(x,y) which contains the true joint distribution P(x,y), our main re- 
sult implies that MDL D [(i;,?/)|.D] converges to the true distribution P(x,y). Indeed 
since/if samples are assumed i.i.d., we don't need to invoke our general result. 

Discriminative learning. Instead of learning a generative [Jeb03j joint distribu- 
tion P(x,y), which requires model assumptions on the input y, we can discrimina- 
tively [LSS07] learn P(-\y) directly without any assumption on y (not even i.i.d). 
We can simply treat y 1:00 as an oracle to all Q, define M.' = {Q'} with Q\x) := 
Q{x\ 2/1:00)5 an d apply our main result to M 1 , leading to MBL' x [A\x]^ P'[A\x], i.e. 
MDL^-°°[A\x,y 1:QO }^P[A\ x,yi-.oo]- This not yet useful since yi :00 is never known 
completely. If xx,X2,--- are conditionally independent, we can write 

l m 

Qi x \y) = Y\Q(%t\yt) = II^^I^ = Q( x ^m\yi-.m) = Q(x 1:e \y 1:m ) 

t = l Xl+l:m t=l ^i-\-\:m 

Taking the limit m — > 00 we get Q(x\y) = Q(x|?/i :00 ). This is a generic property 
satisfied for all causal processes, that a future y t for t > I does not influence past 
observations xx-j. Hence for a class of conditionally independent distributions, we 
get MDI/'^fAlx,?/] ^ P[A\x,y}. Since the x given y are not identically distributed, 
classical MDL consistency results for i.i.d. or stationary-ergodic sources do not apply. 
The following corollary formalizes our findings: 

Corollary 5 (Discriminative MDL) Let M. 3 P be a class of discriminative 
causal distributions Q['\yi :oo ], i-e. Q(x\yi :oo ) = Q{x\y), where x = Xi-j andy = yi : £. 
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Regression and classification are typical examples. Further assume M. is countable. 
Let MDL x ^ y := argmiiiQ e _A/j{— \ogQ(x\y) +K(Q)} be the discriminative MDL mea- 
sure (at time £ given x,y). Then swp A \MDL x ^ y [A\x,y] — P[A\x,y] | — >0 for £(x)— >oo, 
P['\yi:oo] almost surely, for every sequence yi :00 . 

For finite y and conditionally independent x, the intuitive reason how this can 
work is as follows: If y appears in yi yoo only finitely often, it plays asymptotically 
no role; if it appears infinitely often, then P(-\y) can be learned. For infinite y and 
deterministic Ai, the result is also intelligible: Every y might appear only once, but 
probing enough function values Xt = f(yt) allows to identify the function. 

Reinforcement learning (RL). In the agent framework [RN03| . an agent interacts 
with an environment in cycles. At time t, an agent chooses an action y t based on 
past experience x <t = (xi,...,x t _i) and past actions y <t with probability ^{yt\ x <tV<t) 
(say). This leads to a new perception x t with probability ^(xt\x <t y 1:t ) (say). Then 
cycle t+l starts. Let P{xy) = Y\! t= i^{xt\x < tyi:t)'^{yt\x < ty<t) be the joint interaction 
probability. We make no (Markov, stationarity, ergodicity) assumption on \i and ir. 
They may be POMDPs or beyond. 

Corollary 6 (Single-agent MDL) For a fixed policy=agent it, and a class of en- 
vironments {ui,v 2 y} 3 let -M. = {Qi} with Qi(x\y) = Y[t=i u i( x t\ x <tyi:t) ■ Then 
d(P[-\y},MDL xly )^0 with joint P -probability 1. 

The corollary follows immediately from the previous corollary and the facts that 
the Qi are causal and that with P[- |yi :00 ]-probability 1 V|/i :00 implies w.P.p.l jointly 
in x and y. 

In reinforcement learning |SB98] , the perception x t := (o t ,r t ) consists of some 
regular observation o t and a reward r t G [0,1]. Goal is to find a policy which maximizes 
accrued reward in the long run. The previous corollary implies 

Corollary 7 (Fixed-policy MDL value function convergence) Let Vp[xy] := 
Ep[-|a; 3/ ][^+i + 7^+2 + 7 2 ^+3 + •••] be the future -^-discounted P-expected reward sum 
(true value of n), and similarly VqJxt/] for Then the MDL value converges to 
the true value, i.e. V MDLX \ y [xy]—Vp[xy]—^0, w.P.p.l. for any policy 7r . 

Proof. The corollary follows from the general inequality 

|E P [/]-E Q [/]| < sup|/|.sup|P[A]-Q[A]| 

A 

by inserting / : = r> +1 +7r^ +2 +7 2 r^ +3 + ... and P = P[-\xy] and Q = MDL a! ' l '[-|xy], and 
using 0</<l/(l — 7) and Corollary [6j ■ 

Since the value function probes the infinite future, we really made use of our 
convergence result in total variation. Corollary [7| shows that MDL approximates 
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the true value asymptotically arbitrarily well. The result is weaker than it may 
appear. Following the policy that maximizes the estimated (MDL) value is often 
not a good idea, since the policy does not explore properly [Hut05] . Nevertheless, 
it is a reassuring non-trivial result. 

7 Variations 

MDL is more a general principle for model selection than a uniquely defined pro- 
cedure. For instance, there are crude and refined MDL [Grii07], the related MML 
principle [ Wal05j . a static, a dynamic, and a hybrid way of using MDL for predic- 
tion [PH05], and other variations. For our setup, we could have defined multi-step 
lookahead prediction as a product of single-step predictions: 

i 

MDLI(x^) := ]jMDL J '"*(.r,|.r<,). MDLI(z|x) = MDU(xz)/MDU(x) 
t=i 

which is a more incremental MDL version. Both, MDL X and MDLI are 'static' in 
the sense of |PH0 5j , and each allows for a dynamic and a hybrid version. Due to its 
incremental nature, MDLI likely has better predictive properties than MDL X , and 
conveniently defines a single measure over X°°, but inconveniently is ^M.. One 
reason for using MDL is that it can be computationally simpler than Bayes. E.g. if 
M is a class of MDPs, then MDL X is still an MDP and hence tractable, but MDLI 
like Bayes are a nightmare to deal with. 

Acknowledgements. My thanks go to Peter Sunehag for useful discussions. 
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